Unimodal Thompson Sampling for Graph-Structured Arms

نویسندگان

  • Stefano Paladino
  • Francesco Trovò
  • Marcello Restelli
  • Nicola Gatti
چکیده

We study, to the best of our knowledge, the first Bayesian algorithm for unimodal Multi–Armed Bandit (MAB) problems with graph structure. In this setting, each arm corresponds to a node of a graph and each edge provides a relationship, unknown to the learner, between two nodes in terms of expected reward. Furthermore, for any node of the graph there is a path leading to the unique node providing the maximum expected reward, along which the expected reward is monotonically increasing. Previous results on this setting describe the behavior of frequentist MAB algorithms. In our paper, we design a Thompson Sampling–based algorithm whose asymptotic pseudo–regret matches the lower bound for the considered setting. We show that—as it happens in a wide number of scenarios—Bayesian MAB algorithms dramatically outperform frequentist ones. In particular, we provide a thorough experimental evaluation of the performance of our and state– of–the–art algorithms as the properties of the graph vary. Introduction Multi–Armed Bandit (MAB) algorithms (Auer, CesaBianchi, and Fischer 2002) have been proven to provide effective solutions for a wide range of applications fitting the sequential decisions making scenario. In this framework, at each round over a finite horizon T , the learner selects an action (usually called arm) from a finite set and observes only the reward corresponding to the choice she made. The goal of a MAB algorithm is to converge to the optimal arm, i.e., the one with the highest expected reward, while minimizing the loss incurred in the learning process and, therefore, its performance is measured through its expected regret, defined as the difference between the expected reward achieved by an oracle algorithm always selecting the optimal arm and the one achieved by the considered algorithm. We focus on the so–called Unimodal MAB (UMAB), introduced in (Combes and Proutiere 2014a), in which each arm corresponds to a node of a graph and each edge is associated with a relationship specifying which node of the edge gives the largest expected reward (providing thus a partial ordering over the arm space). Furthermore, from any node there is a path leading to the unique node with the maximum expected reward along which the expected reward is monotonically Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. increasing. While the graph structure may be (not necessarily) known a priori by the UMAB algorithm, the relationship defined over the edges is discovered during the learning. In the present paper, we propose a novel algorithm relying on the Bayesian learning approach for a generic UMAB setting. Models presenting a graph structure have become more and more interesting in last years due to the spread of social networks. Indeed, the relationships among the entities of a social network have a natural graph structure. A practical problem in this scenario is the targeted advertisement problem, whose goal is to discover the part of the network that is interested in a given product. This task is heavily influenced by the graph structure, since in social networks people tend to have similar characteristics to those of their friends (i.e., neighbor nodes in the graph), therefore interests of people in a social network change smoothly and neighboring nodes in the graph look similar to each other (McPherson, Smith-Lovin, and Cook 2001; Crandall et al. 2008). More specifically, an advertiser aims at finding those users that maximize the ad expected revenue (i.e., the product between click probability and value per click), while at the same time reducing the amount of times the advertisement is presented to people not interested in its content. Under the assumption of unimodal expected reward, the learner can move from low expected rewards to high ones just by climbing them in the graph, preventing from the need of a uniform exploration over all the graph nodes. This assumption reduces the complexity in the search for the optimal arm, since the learning algorithm can avoid to pull the arms corresponding to some subset of non– optimal nodes, reducing thus the regret. Other applications might benefit from this structure, e.g., recommender systems which aims at coupling items with those users are likely to enjoy them. Similarly, the use of the unimodal graph structure might provide more meaningful recommendations without testing all the users in the social network. Finally, notice that unimodal problems with a single variable, e.g., in sequential pricing (Jia and Mannor 2011), bidding in online sponsored search auctions (Edelman and Ostrovsky 2007) and single–peak preferences economics and voting settings (Mas-Collel, Whinston, and Green 1995), are graph–structured problems in which the graph is a line. Frequentist approaches for UMAB with graph structure ar X iv :1 61 1. 05 72 4v 2 [ cs .L G ] 2 2 N ov 2 01 6 are proposed in (Jia and Mannor 2011) and (Combes and Proutiere 2014a). Jia and Mannor (2011) introduce the GLSE algorithm with a regret of order O( √ T log(T )). However, GLSE performs better than classical bandit algorithms only when the number of arms is Θ(T ). Combes and Proutiere (2014a) present the OSUB algorithm—based on KLUCB—achieving asymptotic regret ofO(log(T )) and outperforming GLSE in settings with a few arms. To the best of our knowledge, no Bayesian approach has been proposed for unimodal bandit settings, included the UMAB setting we study. However, it is well known that Bayesian MAB algorithms—the most popular is Thompson Sampling (TS)—usually suffer of same order of regret as the best frequentist one (e.g., in unstructured settings (Kaufmann, Korda, and Munos 2012)), but they outperform the frequentist methods in a wide range of problems (e.g., in bandit problems without structure (Chapelle and Li 2011) and in bandit problems with budget (Xia et al. 2015)). Furthermore, in problems with structure, the classical Thompson Sampling (not exploiting the problem structure) may outperform frequentist algorithms exploiting the problem structure. For this reason, in this paper we explore Bayesian approaches for the UMAB setting. More precisely, we provide the following original contributions: • we design a novel Bayesian MAB algorithm, called UTS and based on the TS algorithm; • we derive a tight upper bound over the pseudo–regret for UTS, which asymptotically matches the lower bound for the UMAB setting; • we describe a wide experimental campaign showing better performance of UTS in applicative scenarios than those of state–of–the–art algorithms, evaluating also how the performance of the algorithms (ours and of the state of the art) varies as the graph structure properties vary. Related work Here, we mention the main works related to ours. Some works deal with unimodal reward functions in continuous armed bandit setting (Jia and Mannor 2011; Combes and Proutiere 2014b; Kleinberg, Slivkins, and Upfal 2008). In (Jia and Mannor 2011) a successive elimination algorithm, called LSE, is proposed achieving regret of O( √ T log T ). In this case, assumptions over the minimum local decrease and increase of the expected reward is required. Combes and Proutiere (2014b) consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. They propose the SP algorithm, based on the stochastic pentachotomy procedure to narrow the search space. Unimodal MABs on metric spaces are studied in (Kleinberg, Slivkins, and Upfal 2008). An application–dependent solution to the recommendation systems which exploits the similarity of the graph in social network in targeted advertisement has been proposed in (Valko et al. 2014). Similar information has been considered in (Caron and Bhagat 2013) where the problem of cold–start users (i.e., new users) is studied. Another type of structure considered in sequential games is the one of monotonicity of the conversion rate in the price (Trovò et al. 2015). Interestingly, the assumptions of monotonicity and unimodality are orthogonal, none of them being a special case of the other, therefore the results for monotonic setting cannot be used in unimodal bandits. In (Alon et al. 2013; Mannor and Shamir 2011), a graph structure of the arm feedback in an adversarial setting is studied. More precisely, they assume to have correlation over rewards and not over the expected values of arms. Problem Formulation A learner receives in input a finite undirected graph MAB settingG = (A,E), whose verticesA = {a1, . . . , aK} with K ∈ N correspond to the arms and an edge (aiaj) ∈ E exists only if there is a direct partial order relationship between the expected rewards of arms ai and aj . The leaner knows a priori the nodes and the edges (i.e., she knows the graph), but, for each edge, she does not know a priori which is the node of the edge with the largest expected reward (i.e., she does not know the ordering relationship). At each round t over a time horizon of T ∈ N the learner selects an arm ai and gains the corresponding reward xi,t. This reward is drawn from an i.i.d. random variable Xi,t (i.e., we consider a stochastic MAB setting) characterized by an unknown distribution Di with finite known support Ω ⊂ R (as customary in MAB settings, from now on we consider Ω ⊆ [0, 1]) and by unknown expected value μi := E[Xi,t]. We assume that there is a single optimal arm, i.e., there exists a unique arm ai∗ s.t. its expected value μi∗ = maxi μi and, for sake of notation, we denote μi∗ with μ∗. Here we analyze a graph bandit setting with unimodality property, defined as: Definition 1. A graph unimodal MAB (UMAB) settingG = (A,E) is a graph bandit setting G s.t. for each sub–optimal arm ai, i 6= i∗ it exists a finite path p = (i1 = i, . . . , im = i∗) s.t. μik < μik+1 and (aik , aik+1) ∈ E for each k ∈ {1, . . . ,m− 1}. This definition assures that if one is able to identify a non– decreasing path in G of expected rewards, she be able to reach the optimum arm, without getting stuck in local optima. Note that the unimodality property implies that the graph G is connected and therefore we consider only connected graphs from here on. A policy U over a UMAB setting is a procedure able to select at each round t an arm ait by basing on the history ht, i.e., the sequence of past selected arms and past rewards gained. The pseudo–regretRT (U) of a generic policy U over a UMAB setting is defined as: RT (U) := Tμ ∗ − E [ T ∑

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Minimal Exploration in Structured Stochastic Bandits

This paper introduces and addresses a wide class of stochastic bandit problems where the function mapping the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, Lipschitz, unimodal, combinatorial, dueling, . . . ) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and devel...

متن کامل

Unimodal Bandits

We consider multiarmed bandit problems where the expected reward is unimodal over partially ordered arms. In particular, the arms may belong to a continuous interval or correspond to vertices in a graph, where the graph structure represents similarity in rewards. The unimodality assumption has an important advantage: we can determine if a given arm is optimal by sampling the possible directions...

متن کامل

Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms

We consider stochastic multi-armed bandits where the expected reward is a unimodal function over partially ordered arms. This important class of problems has been recently investigated in (Cope, 2009; Yu & Mannor, 2011). The set of arms is either discrete, in which case arms correspond to the vertices of a finite graph whose structure represents similarity in rewards, or continuous, in which ca...

متن کامل

Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms

We consider stochastic multi-armed bandits where the expected reward is a unimodal function over partially ordered arms. This important class of problems has been recently investigated in (Cope, 2009; Yu & Mannor, 2011). The set of arms is either discrete, in which case arms correspond to the vertices of a finite graph whose structure represents similarity in rewards, or continuous, in which ca...

متن کامل

Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem

The multi-objective multi-armed bandit (MOMAB) problem is a sequential decision process with stochastic rewards. Each arm generates a vector of rewards instead of a single scalar reward. Moreover, these multiple rewards might be conflicting. The MOMAB-problem has a set of Pareto optimal arms and an agent’s goal is not only to find that set but also to play evenly or fairly the arms in that set....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017